---
title: "S3Downloader"
id: s3downloader
slug: "/s3downloader"
description: "`S3Downloader` downloads files from AWS S3 buckets to the local filesystem and enriches documents with the local file path."
---

# S3Downloader

`S3Downloader` downloads files from AWS S3 buckets to the local filesystem and enriches documents with the local file path.

<div className="key-value-table">

|  |  |
| --- | --- |
| **Most common position in a pipeline** | Before File Converters or Routers that need local file paths |
| **Mandatory init variables** | `file_root_path`: Path where files will be downloaded. Can be set with `FILE_ROOT_PATH` env var.  <br /> <br />`aws_access_key_id`: AWS access key ID. Can be set with AWS_ACCESS_KEY_ID env var.  <br /> <br />`aws_secret_access_key`: AWS secret access key. Can be set with AWS_SECRET_ACCESS_KEY env var.  <br /> <br />`aws_region_name`: AWS region name. Can be set with AWS_DEFAULT_REGION env var. |
| **Mandatory run variables** | `documents`: A list of documents containing name of the file to download in metadata. |
| **Output variables** | `documents`: A list of documents enriched with the local file path in `meta['file_path']` |
| **API reference** | [S3Downloader](/reference/integrations-amazon-bedrock) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/amazon_bedrock |

</div>

## Overview

`S3Downloader` downloads files from AWS S3 buckets to your local filesystem and enriches Document objects with the local file path. This component is useful for pipelines that need to process files stored in S3, such as PDFs, images, or text files.

The component supports AWS authentication through environment variables by default. You can set `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_DEFAULT_REGION` environment variables. Alternatively, you can pass credentials directly at initialization using the [Secret API](../../concepts/secret-management.mdx):

```python
from haystack.utils import Secret
from haystack_integrations.components.downloaders.s3 import S3Downloader

downloader = S3Downloader(
    aws_access_key_id=Secret.from_token("<your-access-key-id>"),
    aws_secret_access_key=Secret.from_token("<your-secret-access-key>"),
    aws_region_name=Secret.from_token("<your-region>"),
    file_root_path="/path/to/download/directory"
)
```

The component downloads multiple files in parallel using the `max_workers` parameter (default is 32 workers) to speed up processing of large document sets. Downloaded files are cached locally, and when the cache exceeds `max_cache_size` (default is 100 files), least recently accessed files are automatically removed. Already downloaded files are touched to update their access time without re-downloading.

:::info Required Configuration

The component requires two critical configurations:

1. `file_root_path` parameter or `FILE_ROOT_PATH` environment variable: Specifies where files will be downloaded. This directory will be created if it doesn't exist when `warm_up()` is called.
2. `S3_DOWNLOADER_BUCKET` environment variable: Specifies which S3 bucket to download files from.
:::

The optional environment variable `S3_DOWNLOADER_PREFIX` can be set to add a prefix of the files to all generated S3 keys.

### File Extension Filtering

You can use the `file_extensions` parameter to download only specific file types, reducing unnecessary downloads and processing time. For example, `file_extensions=[".pdf", ".txt"]` downloads only PDF and TXT files while skipping others.

### Custom S3 Key Generation

By default, the component uses the `file_name` from Document metadata as the S3 key. If your S3 file structure doesn't match the file names in metadata, you can provide an optional `s3_key_generation_function` to customize how S3 keys are generated from Document metadata.

## Usage

You need to install the `amazon-bedrock-haystack` package to use `S3Downloader`:

```shell
pip install amazon-bedrock-haystack
```

### On its own

Before running the examples, ensure you have set the required environment variables:

```shell
export AWS_ACCESS_KEY_ID="<your-access-key-id>"
export AWS_SECRET_ACCESS_KEY="<your-secret-access-key>"
export AWS_DEFAULT_REGION="<your-region>"
export S3_DOWNLOADER_BUCKET="<your-bucket-name>"
```

Here's how to use `S3Downloader` to download files from S3:

```python
from haystack.dataclasses import Document
from haystack_integrations.components.downloaders.s3 import S3Downloader

## Create documents with file names in metadata
documents = [
    Document(meta={"file_name": "report.pdf"}),
    Document(meta={"file_name": "data.txt"}),
]

## Initialize the downloader
downloader = S3Downloader(file_root_path="/tmp/s3_downloads")

## Warm up the component
downloader.warm_up()

## Download the files
result = downloader.run(documents=documents)

## Access the downloaded files
for doc in result["documents"]:
    print(f"File downloaded to: {doc.meta['file_path']}")
```

With file extension filtering:

```python
from haystack.dataclasses import Document
from haystack_integrations.components.downloaders.s3 import S3Downloader

documents = [
    Document(meta={"file_name": "report.pdf"}),
    Document(meta={"file_name": "image.png"}),
    Document(meta={"file_name": "data.txt"}),
]

## Only download PDF files
downloader = S3Downloader(
    file_root_path="/tmp/s3_downloads",
    file_extensions=[".pdf"]
)

downloader.warm_up()

result = downloader.run(documents=documents)

## Only report.pdf is downloaded
print(f"Downloaded {len(result['documents'])} file(s)")
## Output: Downloaded 1 file(s)
```

With custom S3 key generation:

```python
from haystack.dataclasses import Document
from haystack_integrations.components.downloaders.s3 import S3Downloader

def custom_s3_key_function(document: Document) -> str:
    """Generate S3 key from custom metadata."""
    folder = document.meta.get("folder", "default")
    file_name = document.meta.get("file_name")
    if not file_name:
        raise ValueError("Document must have 'file_name' in metadata")
    return f"{folder}/{file_name}"

documents = [
    Document(meta={"file_name": "report.pdf", "folder": "reports/2025"}),
]

downloader = S3Downloader(
    file_root_path="/tmp/s3_downloads",
    s3_key_generation_function=custom_s3_key_function
)

downloader.warm_up()
result = downloader.run(documents=documents)
```

### In a pipeline

Here's an example of using `S3Downloader` in a document processing pipeline:

```python
from haystack import Pipeline
from haystack.components.converters import PDFMinerToDocument
from haystack.components.routers import DocumentTypeRouter
from haystack.dataclasses import Document

from haystack_integrations.components.downloaders.s3 import S3Downloader

## Create a pipeline
pipe = Pipeline()

## Add S3Downloader to download files from S3
pipe.add_component(
    "downloader",
    S3Downloader(
        file_root_path="/tmp/s3_downloads",
        file_extensions=[".pdf", ".txt"]
    )
)

## Route documents by file type
pipe.add_component(
    "router",
    DocumentTypeRouter(
        file_path_meta_field="file_path",
        mime_types=["application/pdf", "text/plain"]
    )
)

## Convert PDFs to documents
pipe.add_component("pdf_converter", PDFMinerToDocument())

## Connect components
pipe.connect("downloader.documents", "router.documents")
pipe.connect("router.application/pdf", "pdf_converter.documents")

## Create documents with S3 file names
documents = [
    Document(meta={"file_name": "report.pdf"}),
    Document(meta={"file_name": "summary.txt"}),
]

## Run the pipeline
result = pipe.run({"downloader": {"documents": documents}})
```

For a more complex example with image processing and LLM:

```python
from haystack import Pipeline
from haystack.components.builders import ChatPromptBuilder
from haystack.components.converters.image import DocumentToImageContent
from haystack.components.routers import DocumentTypeRouter
from haystack.dataclasses import Document

from haystack_integrations.components.downloaders.s3 import S3Downloader
from haystack_integrations.components.generators.amazon_bedrock import AmazonBedrockChatGenerator

## Create documents with file names
documents = [
    Document(meta={"file_name": "chart.png"}),
    Document(meta={"file_name": "report.pdf"}),
]

## Create pipeline
pipe = Pipeline()

## Download files from S3
pipe.add_component(
    "downloader",
    S3Downloader(file_root_path="/tmp/s3_downloads")
)

## Route by document type
pipe.add_component(
    "router",
    DocumentTypeRouter(
        file_path_meta_field="file_path",
        mime_types=["image/png", "application/pdf"]
    )
)

## Convert images for LLM
pipe.add_component("image_converter", DocumentToImageContent(detail="auto"))

## Create chat prompt with template
template = """{% message role="user" %}
Answer the question based on the provided images.

Question: {{ question }}

{% for image in image_contents %}
{{ image | templatize_part }}
{% endfor %}
{% endmessage %}"""

pipe.add_component(
    "prompt_builder",
    ChatPromptBuilder(template=template)
)

## Generate response
pipe.add_component(
    "llm",
    AmazonBedrockChatGenerator(model="anthropic.claude-3-haiku-20240307-v1:0")
)

## Connect components
pipe.connect("downloader.documents", "router.documents")
pipe.connect("router.image/png", "image_converter.documents")
pipe.connect("image_converter.image_contents", "prompt_builder.image_contents")
pipe.connect("prompt_builder.prompt", "llm.messages")

## Run pipeline
result = pipe.run({
    "downloader": {"documents": documents},
    "prompt_builder": {"question": "What information is shown in the chart?"}
})
```
